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Abstract 

Recursive neural models, which use syn¬ 
tactic parse trees to recursively generate 
representations bottom-up, are a popular 
architecture. But there have not been 
rigorous evaluations showing for exactly 
which tasks this syntax-based method is 
appropriate. In this paper we bench¬ 
mark recursive neural models against se¬ 
quential recurrent neural models (simple 
recurrent and LSTM models), enforcing 
apples-to-apples comparison as much as 
possible. We investigate 4 tasks: (1) sen¬ 
timent classification at the sentence level 
and phrase level; (2) matching questions 
to answer-phrases; (3) discourse pars¬ 
ing; (4) semantic relation extraction (e.g., 
component-whole between nouns). 

Our goal is to understand better when, 
and why, recursive models can outper¬ 
form simpler models. We find that re¬ 
cursive models help mainly on tasks (like 
semantic relation extraction) that require 
associating headwords across a long dis¬ 
tance, particularly on very long sequences. 

We then introduce a method for allowing 
recurrent models to achieve similar per¬ 
formance: breaking long sentences into 
clause-like units at punctuation and pro¬ 
cessing them separately before combin¬ 
ing. Our results thus help understand the 
limitations of both classes of models, and 
suggest directions for improving recurrent 
models. 

1 Introduction 

Deep learning based methods learn low¬ 
dimensional, real-valued vectors for word 
tokens, mostly from large-scale data corpus (e.g., 
(Mikolov et al., 2013; Le and Mikolov, 2014; 


Collobert et al., 2011)), successfully capturing 
syntactic and semantic aspects of text. 

For tasks where the inputs arc larger text units 
(e.g., phrases, sentences or documents), a compo¬ 
sitional model is first needed to aggregate tokens 
into a vector with fixed dimensionality that can be 
used as a feature for other NLP tasks. Models for 
achieving this usually fall into two categories: re¬ 
current models and recursive models: 

Recurrent models (also referred to as sequence 
models) deal successfully with time-series data 
(Pearlmutter, 1989; Dorffner, 1996) like speech 
(Robinson et al., 1996; Lippmann, 1989; Graves et 
al., 2013) or handwriting recognition (Graves and 
Schmidhuber, 2009; Graves, 2012). They were ap¬ 
plied early on to NLP (Elman, 1990), modeling a 
sentence as tokens processed sequentially, at each 
step combining the current token with previously 
built embeddings. Recurrent models can be ex¬ 
tended to bidirectional ones from both left-to-right 
and right-to-left. These models generally consider 
no linguistic structure aside from word order. 

Recursive neural models (also referred to as 
tree models), by contrast, arc structured by syn¬ 
tactic parse trees. Instead of considering tokens 
sequentially, recursive models combine neighbors 
based on the recursive structure of parse trees, 
stalling from the leaves and proceeding recur¬ 
sively in a bottom-up fashion until the root of 
the parse tree is reached. For example, for the 
phrase the food is delicious, following the oper¬ 
ation sequence ( (the food) (is delicious) ) rather 
than the sequential order ((( the food) is) delicious). 
Many recursive models have been proposed (e.g., 
(Paulus et al., 2014; Irsoy and Cardie, 2014)), and 
applied to various NLP tasks, among them en- 
tailment (Bowman, 2013; Bowman et al., 2014), 
sentiment analysis (Socher et al., 2013; Irsoy 
and Cardie, 2013; Dong et al., 2014), question¬ 
answering (Iyyer et al., 2014), relation classifica¬ 
tion (Socher et al., 2012; Hashimoto et al., 2013), 



and discourse (Li and Hovy, 2014). 

One possible advantage of recursive models is 
their potential for capturing long-distance depen¬ 
dencies: two tokens may be structurally close to 
each other, even though they arc far away in word 
sequence. For example, a verb and its correspond¬ 
ing direct object can be far away in terms of tokens 
if many adjectives lies in between, but they arc ad¬ 
jacent in the parse tree (Irsoy and Cardie, 2013). 
But we don’t know if this advantage is truly im¬ 
portant, and if so for which tasks, or whether other 
issues arc at play. Indeed, the reliance of recursive 
models on parsing is also a potential disadvan¬ 
tage, given that parsing is relatively slow, domain- 
dependent, and can be errorful. 

On the other hand, recent progress in multi¬ 
ple subfields of neural NLP has suggested that re¬ 
current nets may be sufficient to deal with many 
of the tasks for which recursive models have 
been proposed. Recurrent models without parse 
structures have shown good results in sequence- 
to-sequence generation (Sutskever et al., 2014) 
for machine translation (e.g., (Kalchbrenner and 
Blunsom, 2013; 3; Luong et al., 2014)), pars¬ 
ing (Vinyals et al., 2014), and sentiment, where 
for example recurrent-based paragraph vectors (Le 
and Mikolov, 2014) outperform recursive models 
(Socher et al., 2013) on the Stanford sentiment- 
bank dataset. 

Our goal in this paper is thus to investigate a 
number of tasks with the goal of understanding 
for which kinds of problems recurrent models may 
be sufficient, and for which kinds recursive mod¬ 
els offer specific advantages. We investigate four 
tasks with different properties. 

• Binary sentiment classification at the sen¬ 
tence level (Pang et al., 2002) and phrase 
level (Socher et al., 2013) that focus on 
understanding the role of recursive models 
in dealing with semantic compositionally in 
various scenarios such as different lengths of 
inputs and whether or not supervision is com¬ 
prehensive. 

• Phrase Matching on the UMD-QA dataset 
(Iyyer et al., 2014) can help see the difference 
between outputs from intermediate compo¬ 
nents from different models, i.e., representa¬ 
tions for intermediate parse tree nodes and 
outputs from recurrent models at different 
time steps. It also helps see whether pars¬ 


ing is useful for finding similarities between 
question sentences and target phrases. 

• Semantic Relation Classification on the 

SemEval-2010 (Hendrickx et al., 2009) data 
can help understand whether parsing is help¬ 
ful in dealing with long-term dependencies, 
such as relations between two words that are 
far apart in the sequence. 

• Discourse parsing (RST dataset) is useful 
for measuring the extent to which parsing im¬ 
proves discourse tasks that need to combine 
meanings of larger text units. Discourse pars¬ 
ing treats elementary discourse units (EDUs) 
as basic units to operate on, which arc usually 
short clauses. The task also sheds light on 
the extent to which syntactic structures help 
acquire shot text representations. 

The principal motivation for this paper is to un¬ 
derstand better when, and why, recursive models 
arc needed to outperform simpler models by en¬ 
forcing apples-to-apples comparison as much as 
possible. This paper applies existing models to 
existing tasks, barely offering novel algorithms or 
tasks. Our goal is rather an analytic one, to inves¬ 
tigate different versions of recursive and recurrent 
models. This work helps understand the limita¬ 
tions of both classes of models, and suggest direc¬ 
tions for improving recurrent models. 

The rest of this paper organized as follows: We 
detail versions of recursive/recurrent models in 
Section 2, present the tasks and results in Section 
3, and conclude with discussions in Section 4. 

2 Recursive and Recurrent Models 

2.1 Notations 

We assume that the text unit S, which could 
be a phrase, a sentence or a document, is com¬ 
prised of a sequence of tokens/words: S = 
{wi,W 2 , --,wn s }, where N s denotes the num¬ 
ber of tokens in S. Each word w is associated 
with a K-dimensional vector embedding e w = 
{e^, efu, ■■■, e^}. The goal of recursive and re¬ 
current models is to map the sequence to a K- 
dimensional eg, based on its tokens and their cor¬ 
respondent embeddings. 

Standard Recurrent/Sequence Models A re¬ 
current network successively takes word wt at 
step t, combines its vector representation et with 
the previously built hidden vector ht-i from time 



t — 1 , calculates the resulting current embedding 
ht, and passes it to the next step. The embedding 
ht for the current time t is thus: 

h t = f(W ■ h t _i + V • et) (1) 
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where W and V denote compositional matrices. If 
N s denotes the length of the sequence, h^ s repre¬ 
sents the whole sequence S. 

Standard recursive/Tree models Standard re¬ 
cursive models work in a si mi lar way, but process¬ 
ing neighboring words by parse tree order rather 
than sequence order. It computes a representa¬ 
tion for each parent node based on its immediate 
children recursively in a bottom-up fashion until 
reaching the root of the tree. For a given node ?/ 
in the tree and its left child 77 i e f t (with representa¬ 
tion eieft) and right child r/ r ight (with representation 
eighth the standard recursive network calculates e v 
as follows: 


c t = ft- c t -i + ifk ( 6 ) 

h s t = Of c t (7) 

where W 6 R. 4Kx2K . Labels at the 
phrase/sentence level arc predicted representations 
outputted from the last time step. 

Tree LSTMs Recent research has extended the 
LSTM idea to tree-based structures (Zhu et ah, 
2015; Tai et al., 2015) that associate memory and 
forget gates to nodes of the parse trees. 

Bi-directional LSTMs These combine bi¬ 
directional models and LSTMs. 


e V ~ f^W ' e »?left + ' bright) (2) 

Bidirectional Models (Schuster and Paliwal, 
1997) add bidirectionality to the recurrent frame¬ 
work where embeddings for each time arc calcu¬ 
lated both forwardly and backwardly: 

K = f{W^-ht i + y-et) 
ht = f(W^-ht + i + V^-et) 

Normally, final representations for sentences can 
be achieved either by concatenating vectors calcu¬ 
lated from both directions [e|~, ] or using fur¬ 

ther compositional operation to preserve vector di¬ 
mensionality 

h t = f(W L -[hf,h?}) (4) 

where Wl denotes a K x 2 K dimensional matrix. 

Long Short Term Memory (LSTM) LSTM 
models (Hochreiter and Schmidhuber, 1997) are 
defined as follows: given a sequence of inputs 
X = {xi, X 2 , ■■■, x nx }, an LSTM associates each 
timestep with an input, memory and output gate, 
respectively denoted as it, ft and o/. We notation- 
ally disambiguate e and h: et. denotes the vector 
for individual text units (e.g., word or sentence) at 
time step t, while ht. denotes the vector computed 
by the LSTM model at time t by combining et and 
ht- 1 . <7 denotes the sigmoid function. The vector 
representation ht. for each time-step t is given by: 


3 Experiments 

In this section, we detail our experimental settings 
and results. We consider the following tasks, each 
representative of a different class of NLP tasks. 

1. Binary sentiment classification on the Pang et 
al. (2002) dataset. This addresses the issues where 
supervision only appeal's globally after a long se¬ 
quence of operations. 

2. Sentiment Classification on the Stanford 
Sentiment Treebank (Socher et ah, 2013): com¬ 
prehensive labels are found for words and phrases 
where local compositionally (such as from nega¬ 
tion, mood, or others cued by phrase-structure) is 
to be learned. 

3. Sentence-Target Matching on the UMD-QA 
dataset (Iyyer et ah, 2014): Learns matches be¬ 
tween target and components in the source sen¬ 
tences, which are parse tree nodes for recursive 
models and different time-steps for recurrent mod¬ 
els. 

4. Semantic Relation Classification on the 

SemEval-2010 task (Hendrickx et al., 2009). 
Learns long-distance relationships between two 
words that may be far apart sequentially. 

5. Discourse Parsing (Li et al., 2014; Hernault et 
al., 2010): Learns sentence-to-sentence relations 
based on calculated representations. 

In each case we followed the protocols de¬ 
scribed in the original papers. We first group the 
algorithm variants into two groups as follows: 



• Standard tree models vs standard sequence 
models vs standard bi-directional sequence 
models 

• LSTM tree models, LSTM sequence models 
vs LSTM bi-directional sequence models. 

We employed standard training frameworks for 
neural models: for each task, we used stochas¬ 
tic gradient decent using AdaGrad (Duchi et al., 
2011) with minibatches (Cotter et al., 2011). Pa¬ 
rameters arc tuned using the development dataset 
if available in the original datasets or from cross- 
validation if not. Derivatives are calculated from 
standard back-propagation (Goller and Kuchler, 
1996). Parameters to tune include size of mini 
batches, learning rate, and parameters for L2 pe¬ 
nalizations. The number of running iterations 
is treated as a parameter to tune and the model 
achieving best performance on the development 
set is used as the final model to be evaluated. 

For settings where no repeated experiments are 
performed, the bootstrap test is adopted for sta¬ 
tistical significance testing (Efron and Tibshirani, 
1994). Test scores that achieve significance level 
of 0.05 arc marked by an asterisk (*). 

3.1 Stanford Sentiment TreeBank 

Task Description We start with the Stanford 
Sentiment TreeBank (Socher et al., 2013). This 
dataset contains gold-standard labels for every 
parse tree constituent, from the sentence to phrases 
to individual words. 

Of course, any conclusions drawn from imple¬ 
menting sequence models on a dataset that was 
based on parse trees may have to be weakened, 
since sequence models may still benefit from the 
way that the dataset was collected. Nevertheless 
we add an evaluation on this dataset because it has 
been a widely used benchmark dataset for neural 
model evaluations. 

For recursive models, we followed the proto¬ 
cols in Socher et al. (2013) where node embed¬ 
dings in the parse trees arc obtained from recur¬ 
sive models and then fed to a softmax classifier. 
We transformed the dataset for recurrent model 
use as illustrated in Figure 1. Each phrase is recon¬ 
structed from parse tree nodes and treated as a sep¬ 
arate data point. As the treebank contains 11,855 
sentences with 215,154 phrases, the reconstructed 
dataset for recurrent models comprises 215,154 
examples. Models arc evaluated at both the phrase 


level (82,600 instances) and the sentence root level 
(2,210 instances). 



Fine-Grained 

Binary 

Tree 

0.433 

0.815 

Sequence 

0.420 (-0.013) 

0.807 (-0.007) 

P-value 

0.042* 

0.098 

Bi-Sequence 

0.435 (+0.08) 

0.816 (+0.002) 

P-value 

0.078 

0.210 


Table 1: Test set accuracies on the Stanford Senti¬ 
ment Treebank at root level. 



Fine-Grained 

Binary 

Tree 

0.820 

0.860 

Sequence 

0.818 (-0.002) 

0.864 (+0.004) 

P-value 

0.486 

0.305 

Bi-Sequence 

0.826 (+0.06) 

0.862 (+0.002) 

P-value 

0.148 

0.450 


Table 2: Test set accuracies on the Stanford Senti¬ 
ment Treebank at phrase level. 

Results are shown in Table 1 and 2 1 . When 
comparing the standard version of tree models 
to sequence models, we find it helps a bit at 
root level identification (for sequences but not bi¬ 
sequences), but yields no significant improvement 
at the phrase level. 

LSTM Tai et al. (2015) discovered that LSTM 
tree models generate better performances in terms 
of sentence root level evaluation than sequence 
models. We explore this task a bit more by training 
deeper and more sophisticated models. We exam¬ 
ine the following three models: 

1. Tree-structured LSTM models (Tai et al., 
2015) 2 . 

2. Deep Bi-LSTM sequence models (denoted as 
Sequence) that treat the whole sentence as 
just one sequence. 

3. Deep Bi-LSTM hierarchical sequence mod¬ 
els (denoted as Hierarchical Sequence) that 
first slice the sentence into a sequence of sub¬ 
sentences by using a look-up table of punc¬ 
tuations (i.e., comma, period, question mark 
and exclamation mark). The representation 
for each sub-sentence is first computed sep¬ 
arately, and another level of sequence LSTM 

’The performance of our implementations of recursive 
models is not exactly identical to that reported in Socher et 
al. (2013), but the relative difference is around 1% to 2%. 

2 Tai et al.. achieved 0.510 accuracy in terms of fine¬ 
grained evaluation at the root level as reported in (Tai et al., 
2015), similar to results from our implementations (0.504). 
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Figure 1: Transforming Stanford Sentiment Treebank to Sequences for Sequence Models. 


A , B, C , D 

Standard Bi - directional Model 


t t t t 

<—> <—> > <—> 

A, B, C, D 

Hierarchical Bi - directional Model 

Figure 2: Illustration of two sequence models. A, 
B, C, D denote clauses or sub sentences separated 
by punctuation. 

(one-directional) is then used to join the sub¬ 
sentences. Illustrations are shown in Figure2. 

We consider the third model because the dataset 
used in Tai et al. (2015) contains long sentences 
and the evaluation is performed only at the sen¬ 
tence root level. Since a parsing algorithm will 
naturally break long sentences into sub-sentences, 
we’d like to know whether any performance boost 
is introduced by the intra-clause parse tree struc¬ 
ture or just by this broader segmentation of a 
sentence into clause-like units; this latter advan¬ 
tage could be approximated by using punctuation- 
based approximations to clause boundaries. 

We run 15 iterations for each algorithm. Pa¬ 
rameters are harvested at the end of each iteration; 
those performing best on the dev set are used on 
the test set. The whole process takes roughly 15- 
20 minutes on a single GPU machine 3 . For a more 
convincing comparison, we did not use the boot¬ 
strap test where parallel examples are generated 
from one same dataset. Instead, we repeated the 
aforementioned procedure for each algorithm 20 
times and report accuracies with standard devia¬ 
tion in Table 3. 

Tree LSTMs are equivalent or marginally bet¬ 
ter than standard bi-directional sequence model 

3 Tesla K40m, 2880 Cuda cores. 


Model 

all-fine 

root-fine 

root-coarse 

Tree LSTM 

83.4 (0.3) 

50.4 (0.9) 

86.7 (0.5) 

Bi-Sequence 

83.3 (0.4) 

49.8 (0.9) 

86.7 (0.5) 

Hier-Sequence 

82.9 (0.3) 

50.7 (0.8) 

86.9 (0.6) 


Table 3: Test set accuracies on the Stanford Sen¬ 
timent Treebank with deviations. For our exper¬ 
iments, we report accuracies over 20 runs with 
standard deviation. 

(two-tailed p-value equals 0.041*, and only at the 
root level, with p-value for the phrase level at 
0.376). The hierarchical sequence model achieves 
the same performance with a p-value of 0.198. 

Discussion The results above suggest that 
clausal segmentation of long sentences offers a 
slight performance boost, a result also supported 
by the fact that very little difference exists between 
the three models for phrase-level sentiment eval¬ 
uation. Clausal segmentation of long sentences 
thus provides a simple approximation to parse-tree 
based models. 

We suggest a few reasons for this slightly better 
performances introduced by clausal segmentation: 

1. Treating clauses as basic units (to the extent 
that punctuation approximates clauses) pre¬ 
serves the semantic structure of text. 

2. Semantic compositions such as negations or 
conjunctions usually appear at the clause 
level. Working on clauses individually 
and then combining them model inter-clause 
compositions. 

3. Errors are back-propagated to individual to¬ 
kens using fewer steps in hierarchical models 
than in standard models. Consider a movie 
review “simple as the plot was , i still like it a 
lot”. With standard recurrent models it takes 
12 steps before the prediction error gets back 
to the first token “simple”: 















It's definitely not dull 
®® • • ® 

He is one of the least compelling variations 
® ® ® ® ® • © 0 
I like every single minute of this file 
®@ ® © • •• • 

I did not like the film 
@® ® • • • 


Figure 3: Sentiment prediction using a one- 
directional (left to right) LSTM. Decisions at each 
time step are made by feeding embeddings calcu¬ 
lated from the LSTM into a softmax classifier. 

error—>4ot— >a— Ht—dike—)>still— h— was 
y plot y the—>as—^simple 

In a hierarchical model, the second clause is 
compacted into one component, and the error 
propagation is thus given by: 

error—» second-clause —> first-clause —> 
was—>-plot—Hhe—i-as—^simple. 

Propagation with clause segmentation con¬ 
sists of only 8 operations. Such a procedure 
thus tends to attenuate the gradient vanish¬ 
ing problem, potentially yielding better per¬ 
formance. 


3.2 Binary Sentiment Classification (Pang) 

Task Description: The sentiment dataset 
of Pang et al. (2002) consists of sentences 
with a sentiment label for each sentence. 
We divide the original dataset into train- 
ing(8101)/dev(500)/testing(2000). No pre¬ 
training procedure as described in Socher et al. 
(2011b) is employed. Word embeddings are 
initialized using skip-grams and kept fixed in 
the learning procedure. We trained skip-gram 
embeddings on the Wikipedia+Gigaword dataset 
using the word2vec package 4 . Sentence level 
embeddings are fed into a sigmoid classifier. 
Performances for 50 dimensional vectors are 
given in the table below: 

Discussion Why don’t parse trees help on this 
task? One possible explanation is the distance 
of the supervision signal from the local composi¬ 
tional structure. The Pang et al. dataset has an av¬ 
erage sentence length of 22.5 words, which means 

4 https://code.google.com/p/word2vec/ 



Standard 

LSTM 

Tree 

0.745 

0.774 

Sequence 

0.733 (-0.012) 

0.783 (+0.008) 

P-value 

0.060 

0.136 

Bi-Sequence 

0.754 (+0.09) 

0.790 (+0.016) 

P-value 

0.058 

0.024* 


Table 4: Test set accuracies on the Pang’s senti¬ 
ment dataset using Standard model settings. 

it takes multiple steps before sentiment related ev¬ 
idence comes up to the surface. It is therefore un¬ 
clear whether local compositional operators (such 
as negation) can be learned; there is only a small 
amount of training data (around 8,000 examples) 
and the sentiment supervision only at the level of 
the sentence may not be easy to propagate down to 
deeply buried local phrases. 

3.3 Question-Answer Matching 

Task Description: In the question-answering 
dataset QANTA 5 , each answer is a token or short 
phrase. The task is different from standard gener¬ 
ation focused QA task but formalized as a multi¬ 
class classification task that matches a source 
question with a candidates phrase from a prede¬ 
fined pool of candidate phrases We give an illus¬ 
trative example here: 

Question: He left unfinished a novel whose title 
character forges his father’s signature to get out 
of school and avoids the draft by feigning desire 
to join. Name this German author of The Magic 
Mountain and Death in Venice. 

Answer: Thomas Mann from the pool of 
phrases. Other candidates might include George 
Washington, Charlie Chaplin, etc. 

The model of Iyyer et al. (2014) minimizes the 
distances between answer embeddings and node 
embeddings along the parse tree of the question. 
Concretely, let c denote the correct answer to ques¬ 
tion S, with embedding c, and z denoting any ran¬ 
dom wrong answer. The objective function sums 
over the dot product between representation for 
every node // along the question parse trees and 
the answer representations: 

L= ^2max(0,l-c-e v + z-e v ) (8) 

77 E [parse tree] z 

where e v denotes the embedding for parse tree 
node calculated from the recursive neural model. 

5 http://cs.umd.edu/-miyyer/qblearn/. Be¬ 
cause the publicly released dataset is smaller than the version 
used in (Iyyer et al., 2014) due to privacy issues, our numbers 
are not comparable to those in (Iyyer et al., 2014). 

















Here the parse trees arc dependency parses follow¬ 
ing (Iyyer et al., 2014). 

By adjusting the framework to recurrent mod¬ 
els, we minimize the distance between the answer 
embedding and the embeddings calculated from 
each timestep t of the sequence: 

L = E E max{ 0,1 — c ■ et + z ■ et) (9) 

te[i,N s ] z 

At test time, the model chooses the answer (from 
the set of candidates) that gives the lowest loss 
score. As can be seen from results presented in 
Table 5, the difference is only significant for the 
LSTM setting between the tree model and the 
sequence model; no significant difference is ob¬ 
served for other settings. 



Standard 

LSTM 

Tree 

0.523 

0.558 

Sequence 

0.525 (+0.002) 

0.546 (-0.012) 

P-value 

0.490 

0.046* 

Bi-Sequence 

0.530 (+0.007) 

0.564 (+0.006) 

P-value 

0.075 

0.120 


Table 5: Test set accuracies for UMD-QA dataset. 

Discussion The UMD-QA task represents a 
group of situations where because we have in¬ 
sufficient supervision about matching (it’s hard 
to know which node in the parse tree or which 
timestep provides the most direct evidence for the 
answer), decisions have to be made by looking at 
and iterating over all subunits (all nodes in parse 
trees or timesteps). Similar ideas can be found in 
pooling structures (e.g. Socher et al. (2011a)). 

The results above illustrate that for tasks where 
we try to align the target with different source 
components (i.e., parse tree nodes for tree mod¬ 
els and different time steps for sequence models), 
components from sequence models are able to em¬ 
bed important information, despite the fact that se¬ 
quence model components are just sentence frag¬ 
ments and hence usually not linguistically mean¬ 
ingful components in the way that parse tree con¬ 
stituents are. 

3.4 Semantic Relationship Classification 

Task Description: SemEval-2010 Task 8 (Hen- 
drickx et al., 2009) is to find semantic rela¬ 
tionships between pairs of nominals, e.g., in 
“My [apartmentjei has a pretty large [kitchen]^” 
classifying the relation between [apartment] and 


[kitchen] as component-whole. The dataset con¬ 
tains 9 ordered relationships, so the task is formal¬ 
ized as a 19-class classification problem, with di¬ 
rected relations treated as separate labels; see Hen- 
drickx et al. (2009; Socher et al. (2012) for details. 

For the recursive implementations, we follow 
the neural framework defined in Socher et al. 
(2012). The path in the parse tree between the two 
nominals is retrieved, and the embedding is calcu¬ 
lated based on recursive models and fed to a soft- 
max classifier 6 . Retrieved paths are transformed 
for the recurrent models as shown in Figure 5. 



Figure 4: Illustration of Models for Semantic Re¬ 
lationship Classification. 

Discussion Unlike for earlier tasks, here recur¬ 
sive models yield much better performance than 
the corresponding recurrent versions for all ver¬ 
sions (e.g., standard tree vs. standard sequence, 
p = 0.004). These results suggest that it is the 
need to integrate structures far apart in the sen¬ 
tence that characterizes the tasks where recursive 
models surpass recurrent models. In parse-based 
models, the two target words are drawn together 
much earlier in the decision process than in recur¬ 
rent models, which must remember one target un¬ 
til the other one appears. 

3.5 Discourse Parsing 

Task Description: Our final task, discourse 
parsing based on the RST-DT corpus (Carlson et 

s (Socher et al., 2012) achieve state-of-art performance 
by combining a sophisticated model, MV-RNN. in which 
each word is presented with both a matrix and a vector with 
human-feature engineering. Again, because MV-RNN is dif¬ 
ficult to adapt to a recurrent version, we do not employ this 
state-of-the-art model, adhering only to the general versions 
of recursive models described in Section 2, since our main 
goal is to compare equivalent recursive and recurrent models 
rather than implement the state of the art. 














Standard 

LSTM 

Tree 

0.748 

0.767 

Sequence 

0.712 (-0.036) 

0.740 (-0.027) 

P-value 

0.004* 

0.020* 

Bi-Sequence 

0.730 (-0.018) 

0.752 (-0.014) 

P-value 

0.017* 

0.041* 


Table 6: Test set accuracies on the SemEval-2010 
Semantic Relationship Classification task. 



e 2 


Figure 5: An illustration of discourse parsing. 
[ei,e 2 ,-..] denote EDUs (elementary discourse 
units), each consisting of a sequence of tokens. 
[r’i 2 , r. 34 , r^f,] denote relationships to be classified. 
A binary classification model is first used to decide 
whether two EDUs should be merged and a multi¬ 
class classifier is then used to decide the relation 
type. 

al., 2003), is to build a discourse tree for a doc¬ 
ument, based on assigning Rhetorical Structure 
Theory (RST) relations between elementary dis¬ 
course units (EDUs). Because discourse relations 
express the coherence structure of discourse, they 
presumably express different aspects of compo¬ 
sitional meaning than sentiment or nominal rela¬ 
tions. See Hernault et al. (2010) for more details 
on discourse parsing and the RST-DT coipus. 

Representations for adjacent EDUs are fed into 
binary classification (whether two EDUs are re¬ 
lated) and multi-class relation classification mod¬ 
els, as defined in Li et al. (2014). Related EDUs 
are then merged into a new EDU, the representa¬ 
tion of which is obtained through an operation of 
neural composition based on the previous two re¬ 
lated EDUs. This step is repeated until all units 
are merged. 

Discourse parsing takes EDUs as the basic units 
to operate on; EDUs are short clauses, not full sen¬ 
tences, with an average length of 7.2 words. Re¬ 
cursive and recurrent models are applied on EDUs 
to create embeddings to be used as inputs for dis¬ 
course parsing. We use this task for two rea¬ 
sons: (1) to illustrate whether syntactic parse trees 
are useful for acquiring representations for short 
clauses. (2) to measure the extent to which pars¬ 


ing improves discourse tasks that need to combine 
the meanings of larger text units. 

Models are traditionally evaluated in terms of 
three metrics, i.e., spans 7 , nuclearity 8 , and identi¬ 
fying the rhetorical relation between two clauses. 
Due to space limits, we only focus the last one, 
rhetorical relation identification, because (1) rela¬ 
tion labels are treated as correct only if spans and 
nuclearity are correctly labeled (2) relation identi¬ 
fication between clauses offer more insights about 
model’s abilities to represent sentence semantics. 
In order to perform a plain comparison, no addi¬ 
tional human-developed features are added. 



Standard 

LSTM 

Tree 

0.568 

0.564 

Sequence 

0.572 (+0.004) 

0.563 (-0.002) 

P-value 

0.160 

0.422 

Bi-Sequence 

0.578 (+0.01) 

0.575 (+0.012) 

P-value 

0.054 

0.040* 


Table 7: Test set accuracies for relation identifica¬ 
tion on RST discourse parsing data set. 

Discussion We see no large differences between 
equivalent recurrent and recursive models. We 
suggest two possible explanations. (1) EDUs tend 
to be short; thus for some clauses, parsing might 
not change the order of operations on words. Even 
for those whose orders arc changed by parse trees, 
the influence of short phrases on the final represen¬ 
tation may not be great enough. (2) Unlike earlier 
tasks, where text representations are immediately 
used as inputs into classifiers, the algorithm pre¬ 
sented here adopts additional levels of neural com¬ 
position during the process of EDU merging. We 
suspect that neural layers may act as information 
filters, separating the informational chaff from the 
wheat, which in turn makes the model a bit more 
immune to the initial inputs. 

4 Discussion 

We compared recursive and recurrent neural mod¬ 
els for representation learning on 5 distinct NLP 
tasks in 4 areas for which recursive neural models 
are known to achieve good performance (Socher 
et al., 2012; Socher et al., 2013; Li et al., 2014; 
Iyyer et al., 2014). 

As with any comparison between models, our 
results come with some caveats: First, we ex¬ 
plore the most general or basic forms of recur- 

7 on blank tree structures. 

8 on tree structures with nuclearity indication. 



























sive/recurrent models rather than various sophis¬ 
ticated algorithm valiants. This is because fair 
comparison becomes more and more difficult as 
models get complex (e.g., the number of lay¬ 
ers, number of hidden units within each layer, 
etc.). Thus most neural models employed in this 
work arc comprised of only one layer of neural 
compositions—despite the fact that deep neural 
models with multiple layers give better results. 
Our conclusions might thus be limited to the al¬ 
gorithms employed in this paper, and it is unclear 
whether they can be extended to other valiants or 
to the latest state-of-the-art. Second, in order to 
compare models “fairly”, we force every model to 
be trained exactly in the same way: AdaGrad with 
minibatches, same set of initializations, etc. How¬ 
ever, this may not necessarily be the optimal way 
to train every model; different training strategies 
tailored for specific models may improve their per¬ 
formances. In that sense, our attempts to be “fair” 
in this paper may nevertheless be unfair. 

Pace these caveats, our conclusions can be sum¬ 
marized as follows: 

• In tasks like semantic relation extraction, in 
which single headwords need to be associ¬ 
ated across a long distance, recursive models 
shine. This suggests that for the many other 
kinds of tasks in which long-distance seman¬ 
tic dependencies play a role (e.g., translation 
between languages with significant reorder¬ 
ing like Chinese-English translation), syntac¬ 
tic structures from recursive models may of¬ 
fer useful power. 

• Tree models tend to help more on long se¬ 
quences than shorter ones with sufficient su¬ 
pervision: tree models slightly help root 
level identification on the Stanford Sentiment 
Treebank, but do not help much at the phrase 
level. Adopting bi-directional versions of re¬ 
current models seem to largely bridge this 
gap, producing equivalent or sometimes bet¬ 
ter results. 

• On long sequences where supervision is not 
sufficient, e.g., in Pang at al.,’s dataset (super¬ 
vision only exists on top of long sequences), 
no significant difference is observed between 
tree based and sequence based models. 

• In cases where tree-based models do well, a 
simple approximation to tree-based models 


seems to improve recurrent models to equiv¬ 
alent or almost equivalent performance: (1) 
break long sentences (on punctuation) into a 
series of clause-like units, (2) work on these 
clauses separately, and (3) join them together. 
This model sometimes works as well as tree 
models for the sentiment task, suggesting 
that one of the reasons tree models help is 
by breaking down long sentences into more 
manageable units. 

• Despite that the fact that components (out¬ 
puts from different time steps) in recur¬ 
rent models arc not linguistically meaningful, 
they may do as well as linguistically mean¬ 
ingful phrases (represented by parse tree 
nodes) in embedding informative evidence, 
as demonstrated in UMD-QA task. Indeed, 
recent work in parallel with ours (Bowman 
et al., 2015) has shown that recurrent models 
like LSTMs can discover implicit recursive 
compositional structure. 
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